Xgboost, eXtreme Gradient Boosting,是目前在 Kaggle 數據分析平台上奪冠常勝的演算法之一。
這篇教學使用 Agaricus 資料集進行示範 (UCI Machine Learning Repository),預測該香菇是否可食用
matrixdgCMatrixxgb.DMatrix (建議使用)矩陣內僅接受 numeric 資料格式,不支援 categorical,若需進行 One-hot Encoding 轉成 dummy variable 處理
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
訓練資料為 list 結構,含有資料 data 以及標籤 label
str(train)
## List of 2
## $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
## .. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
## .. ..@ Dim : int [1:2] 6513 126
## .. ..@ Dimnames:List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
## .. ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ factors : list()
## $ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...
資料 data 為稀疏矩陣
class(train$data)[1]
## [1] "dgCMatrix"
標籤 label 代表訓練資料的結果,而我們透過 xgboost 預測 label
class(train$label)
## [1] "numeric"
由於我們希望找出香菇為可食或不可食用,因此進行 objective = "binary:logistic" 二元分類
xgboost 詳細的參數設定,在進階篇章會深入探討
採用 dgCMatrix 丟入模型中
bstSparse <- xgboost(data = train$data, label = train$label,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-error:0.046522
## [2] train-error:0.022263
亦可轉換為 xgb.DMatrix
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bstDMatrix <- xgboost(data = dtrain,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-error:0.046522
## [2] train-error:0.022263
到這裡已經建立好基本的 xgboost 模型了
將先前準備好的 test 測試資料集,投入到建立好的模型中進行預測
輸出結果為介在 [0, 1] 的機率,需再經由轉換得到最後的結果
pred <- predict(bstDMatrix, test$data)
head(pred)
## [1] 0.28583017 0.92392391 0.28583017 0.28583017 0.05169873 0.92392391
這邊我們以 0.5 做為分割點,以上代表 1, 反之為 0,得到模型預測的結果
prediction <- as.numeric(pred > 0.5)
head(prediction)
## [1] 0 1 0 0 0 1
建立 Contingency Table 觀察預測結果
table(prediction, test$label)
##
## prediction 0 1
## 0 813 13
## 1 22 763
最後預測結果,準確度高達 98%
err <- mean(prediction != test$label)
print(paste("test-acc =", round(1- err, digits = 2)))
## [1] "test-acc = 0.98"
在作者的文件當中,建議使用 xgb.DMatrix 資料型態,除了能夠讓 xgboost 更快的運算,同時也便於後續進行進階操作
xgb.DMatrix 不同於傳統資料格式,資料和標籤需分開儲存
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
label <- getinfo(dtest, "label")
xgb.DMatrix.save(dtrain, "dtrain.buffer")
dtrain2 <- xgb.DMatrix("dtrain.buffer")
xgboost 和 xgb.train 皆是訓練模型函式
xgboost 為簡單易用版,能夠快速建立基本的模型xgb.train 則提供更多參數調整,且只接受 xgbDMatrixwatchlist <- list(train=dtrain, test=dtest)
bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2,
watchlist=watchlist, objective = "binary:logistic")
## [1] train-error:0.046522 test-error:0.042831
## [2] train-error:0.022263 test-error:0.021726
# Multi Evaluation Metrics
bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2,
watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
## [1] train-error:0.046522 train-logloss:0.233376 test-error:0.042831 test-logloss:0.226686
## [2] train-error:0.022263 train-logloss:0.136658 test-error:0.021726 test-logloss:0.137874
train-error 和 test-error 輸出結果理論上需要相近,若是結果有明顯差異,需要重新檢視分割訓練和測試資料的狀況。
除了套件內提供的衡量標準,也可以視需求撰寫自己需要的標準
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- sqrt(mean((preds-labels)^2))
return(list(metric = "MSE", value = err))
}
dtest <- xgb.DMatrix(test$data, label = test$label)
watchlist <- list(eval = dtest, train = dtrain)
param <- list(max_depth = 2, eta = 1, silent = 1)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, logregobj, evalerror, maximize = FALSE)
## [1] eval-MSE:1.592293 train-MSE:1.595967
## [2] eval-MSE:2.405194 train-MSE:2.409772
xgb.dump(bst, with_stats = T)
## [1] "booster[0]"
## [2] "0:[f28<-9.53674e-07] yes=1,no=2,missing=1,gain=4254.64,cover=1530.58"
## [3] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3,gain=1230.15,cover=869.044"
## [4] "3:leaf=1.30088,cover=763.292"
## [5] "4:leaf=-2.32415,cover=105.752"
## [6] "2:[f108<-9.53674e-07] yes=5,no=6,missing=5,gain=212.286,cover=661.535"
## [7] "5:leaf=-2.58465,cover=649.08"
## [8] "6:leaf=1.48713,cover=12.4552"
## [9] "booster[1]"
## [10] "0:[f59<-9.53674e-07] yes=1,no=2,missing=1,gain=911.658,cover=726.268"
## [11] "1:[f28<-9.53674e-07] yes=3,no=4,missing=3,gain=493.603,cover=707.047"
## [12] "3:leaf=0.77417,cover=429.742"
## [13] "4:leaf=-0.934515,cover=277.305"
## [14] "2:leaf=-6.70608,cover=19.2211"
xgb.plot.tree(model = bst)
查看資料集內各變數對預測結果的重要性程度
importance_matrix <- xgb.importance(model = bstDMatrix)
print(importance_matrix)
## Feature Gain Cover Frequency
## 1: odor=none 0.67615484 0.4978746 0.4
## 2: stalk-root=club 0.17135352 0.1920543 0.2
## 3: stalk-root=rooted 0.12317241 0.1638750 0.2
## 4: spore-print-color=green 0.02931922 0.1461960 0.2
xgb.plot.importance(importance_matrix = importance_matrix)
上面基本練習採用基於樹結構的方法,xgboost 也提供線性方法
線性方法常用在簡單的資料集,有效的抓出資料集內線性關係
但通常樹的方法普遍優於線性,找出變數間非線性的關係,建議嘗試兩種方式進行觀察
booster : "gbLinear"eta parameterbst <- xgb.train(data=dtrain, booster = "gblinear",
max_depth=2, nthread = 2, nrounds=2,
watchlist=watchlist,
eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
## [1] eval-error:0.005587 eval-logloss:0.162564 train-error:0.006909 train-logloss:0.162310
## [2] eval-error:0.000000 eval-logloss:0.056521 train-error:0.001842 train-logloss:0.059108
xgb.save(bst, "xgboost.model")
## [1] TRUE
bst_laod <- xgb.load("xgboost.model")
pred2 <- predict(bst_laod, test$data)